4  Model Building and Selection

In this module, we explore strategies for building effective regression models. We cover: - Increasing model complexity to capture non-linear relationships and interactions, - Assessing model fit while balancing complexity, - Techniques for selecting parsimonious models.

4.1 Increasing model complexity to model complex relationships

4.1.1 Why increase model complexity?

We began this course with the simple linear regression model: a single continuous predictor with a straight-line effect. We then expanded to multiple predictors, where each predictor still had a straight-line effect. For reasons that will become clear, these are examples of so-called first-order models:

Key-term: First-order model
A linear regression where each predictor appears only once and to the first power (no squared terms or interactions). For one predictor: \[ E[Y] = \beta_0 + \beta_1 X. \] With multiple predictors, \(X_1, X_2, \dots, X_i\), this extends to \[E[Y] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_i X_i\]

This first-order structure is the simplest linear model we can make with \(X\). It is often a reasonable starting point, and can go a long way to modeling real-world phenomena (especially in the multiple regression case), but real data are often more complex. Real relationships between predictors and outcomes are often non-linear, or involve interactions between predictors. To capture these more complex relationships, we can increase model complexity by adding ‘higher-order’ and ‘interaction’ terms.

Example 1: Non-linear relationships in data

For example, fitting a straight line to the following data is not ideal, as the relationship between \(X\) and \(Y\) is curved (U-shaped), so there is always part of the relationship the line cannot capture:

4.1.2 Square, Cubic, and Higher-Order Univariate Models

We enrich the basic linear model by adding powers of a predictor, allowing the fitted relationship to bend. In the above case, the “U” shape suggests a quadratic (squared) term may be appropriate.

Key-term: Second-order (Quadratic) univariate model
\[E[Y] = \beta_0 + \beta_1X + \beta_2X^2\]

Example 2: Seeing how \(\beta_2\) bends the curve

Adjust the slider for the squared term to see how changing \(\beta_2\) adds curvature to the fitted relationship . Try to find a good fit - can you guess what model generated this data?.

4.1.2.1 Third-order (cubic) univariate model

\[ E[Y] = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 \]

allows more complex curvature with two changes in direction (an “S”-shape) that a quadratic term cannot capture.

Example 3: Seeing how \(\beta_3\) twists the curve

Adjust the slider for the cubic term to see how changing \(\beta_3\) adds an inflection point to the fitted relationship. Can you tune the parameters to recover the curve that generated this data?

4.1.2.2 N-th order univariate model

We can extend this process of deriving new terms by adding further powers of \(X\) - allowing ius to to fit arbitrarily complex curves: \[ E[Y] = \beta_0 + \beta_1X + \dots + \beta_nX^n. \]

Note that each term gets its own parameter (\(\beta_i\)). If we have \(n\) data points, a (n-1)^th-order model will have a parameter for each point, meaning it will fit the data perfectly!

Example 4: Fitting higher-order polynomial models

Higher-order terms increase the model’s flexibility, but also increase the risk of fitting random noise rather than meaningful structure. We will return to this important issue later in the module.

Note: The meaning of “linear” in linear models
Even though higher-order models can model nonlinear relationships between the outcome and predictors, they are still linear models because each parameter enters additively. “Linearity” here refers to the parameters, not the shape of the fitted curve.

4.1.2.3 Fitting higher-order univariate models in R

In R, higher-order polynomial terms can be included using the I() function to indicate ‘as is’ operations. For example, to fit a quadratic model:

lm(y ~ x + I(x^2), data = nonlinearData)

Call:
lm(formula = y ~ x + I(x^2), data = nonlinearData)

Coefficients:
(Intercept)            x       I(x^2)  
     2.6185      -3.3760       0.8218  

or using the poly() function:

lm(y ~ poly(x, 2), data = nonlinearData)

Call:
lm(formula = y ~ poly(x, 2), data = nonlinearData)

Coefficients:
(Intercept)  poly(x, 2)1  poly(x, 2)2  
     0.3493       0.3941       3.0208  

4.1.3 Interaction Models with Continuous Predictors

In ?sec-slr we saw a multiple regression model with two continuous predictors:

\[ E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2. \]

This again is a first-order model - each predictor appears only once, ‘as is’. Of course, we can add higher-order terms for each predictor separately (e.g., \(X_1^2\), \(X_2^3\)) to capture curvature in their individual effects as we did above. e.g. A second-order multivariate model with two predictors might look like: \[ E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1^2 + \beta_4X_2^2. \]

Now however, we can also consider a different form of higher order term: what happens when we multiply predictors together? This gives us an interaction term - a term that combines two (or more) predictors.

Key-term: Interaction term
An interaction is a product of predictors that allows the effect of one predictor to depend on the level of another. For two continuous predictors, the second-order interaction model is: \[ E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2. \] If \(\beta_3 = 0\) (i.e. as in the first-order model), predictors act independently; if \(\beta_3 \neq 0\), the slope for \(X_1\) varies with \(X_2\) and vice versa.

Example 5: Fitting an interaction model to the advertising dataset.
If we recall our plot of the advertising dataset from ?sec-mlr, our fist-order linear model did not fit the data in particular ways:

ads <- read.csv("Data/Advertising.csv")
ads_mod <- lm(Sales ~ TV + Radio, data = ads)

The plane does not fit the data at the edges - it overestimates sales when either Radio or TV advertising is low (separately), but underestimates sales when both are high. This suggests that the effect of increasing one type of advertising depends on the level of the other type - an interaction effect.

We can fit a second-order interaction model to capture this: \[ E[Sales] = \beta_0 + \beta_1TV + \beta_2Radio + \beta_3(TV \times Radio). \]

The fitted interaction model has a charateristic ‘saddle’ shape, which can better capture the data structure:

The interaction effect can also be visualised by fixing one predictor and plotting the relationship between the other predictor and the outcome. In this case, we can plot lines representing the relationship between TV advertising and Sales when Radio advertising is low ($0 spent), average (i.e. mean(ads$Radio)= 23.264 thousand dollars ), and high (i.e. max(ads$Radio)= $49.6 thousand dollars):

We can see that the relationship between TV advertising and Sales (represented by the slope of the regression lines) is different at different levels of Radio advertising. In particular, as radio advertising increases, the slope of the line increases, indicating that TV advertising has a larger effect on Sales when Radio advertising is also high.

When our first order model has more than two predictors, we can include interaction terms between any pair (or more) of predictors. For example, with three predictors \(X_1\), \(X_2\), and \(X_3\), a second-order interaction model would include interactions between each pair: \[ E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_1X_2 + \beta_5X_1X_3 + \beta_6X_2X_3. \]

4.1.3.1 Higher-order interaction models

We can also consider higher-order interactions - those that include combinations of three or more predictors. For example, a third-order interaction model with three predictors would include the ‘three-way’ interaction term: \[ E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_1X_2 + \beta_5X_1X_3 + \beta_6X_2X_3 + \beta_7X_1X_2X_3. \] The \(\beta_7\) parameter here captures how the interaction between two predictors changes depending on the level of the third predictor.

As in the case of higher-order univariate models, we can extend this idea to include both higher-order interaction terms and higher-order univariate terms for each predictor, allowing for very flexible modeling of complex relationships.

However, as before, increasing model complexity in this way raises the risk of overfitting and interpretability challenges, which we will discuss later in this module.


4.1.3.2 Fitting interaction models in R

In R, interaction terms can be included using the * operator in the formula. For example, to fit a second-order interaction model with two predictors:

lm(Y ~ X1*X2*X3, data = mydata)

This expands to include both main effects and the interaction term. To include only the interaction term without main effects, use the : operator. For example, if we only wanted the interaction between X1 and X2, along with the linear terms for X1, X2, and X3:

lm(Y ~ X1+X2+X3+X1:X2, data = mydata)

4.1.4 Interaction Models with Categorical Predictors

A categorical variable with \(k\) levels is represented by \(k-1\) dummy variables: \[ E[Y] = \beta_0 + \beta_1X + \gamma_1D_1 + \dots + \gamma_{k-1}D_{k-1}. \]

  • Each \(\gamma_j\) measures the difference between level \(j\) and the baseline.
  • Changing the baseline changes interpretations but not fitted values.

Categorical predictors may also interact with continuous predictors: \[ E[Y] = \beta_0 + \beta_1X + \gamma_1D_1 + \delta_1(XD_1), \] allowing each group to have its own slope.

Note
We saw in ?sec-mlr_cat that first order models with categorical predictors are equivalent to the familiar t-tests and ANOVA models from earlier statistics courses. Interaction models with both continuous and categorical predictors are similarly analogous to “ANCOVA” models.

4.2 Assessing model fit and model complexity

4.2.1 Why not increase model complexity?

In the previous section we saw how adding higher-order terms and interactions can help us capture complex relationships in data. However, increasing model complexity is not always beneficial. While more complex models can fit the training data (i.e. the data we use to fit our model) better, they may not generalise well to new data. This is because complex models can start to fit random noise in the training data, rather than the underlying signal. This phenomenon is known as overfitting.

In addition, more complex models can be harder to interpret, making it difficult to understand the relationships between predictors and the outcome. They may also be more sensitive to outliers and multicollinearity, leading to unstable parameter estimates.

A good model is no more complex than necessary to describe the main structure of the data. Therefore, when building regression models, we need to balance the trade-off between model fit and model complexity. In this section, we discuss some common issues that arise with complex models, and metrics for assessing model fit while accounting for complexity.

Key-point: Balancing fit and complexity
A good regression model balances: * Adequate fit to the data, * Simplicity and interpretability, * Generalisability to new data.

4.2.2 Fit Metrics

Fit metrics quantify how well a model captures patterns in the data. We have already seen one such metric: the coefficient of determination, \(R^2\).

Key-term: Coefficient of Determination (\(R^2\))
\[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} \] where \(SS_{res}\) is the residual sum of squares and \(SS_{tot}\) is the total sum of squares. \(R^2\) measures the proportion of variance in the outcome explained by the model, ranging from 0 (no explanatory power) to 1 (perfect fit).

\(R^2\) was introduced in ?sec-slr as a measure of fit for simple linear regression models based soely on the proportion of variance in \(Y\) explained by \(X\). We can also calculate \(R^2\) for multiple regression models, where it measures the proportion of variance in \(Y\) explained by all predictors jointly. However, \(R^2\) has a key limitation: it always increases (or at least does not decrease) when additional predictors are added to the model, even if those predictors do not meaningfully improve the model’s explanatory power. This can lead to overfitting, where a model appears to fit the training data well but performs poorly on new data - for example, the (n-1)th order polynomial model discussed in ?sec-ho_univariate has A SSres of zero and thus an \(R^2\) of 1, but is unlikely to generalise well.

4.2.2.1 Adjusted \(R^2\)

To address this limitation, we can adjust \(R^2\) to penalise the addition of unnecessary predictors. The adjusted \(R^2\) introduces a penalty term based on the number of predictors and the sample size, providing a more balanced measure of model fit that accounts for complexity.

Key-term: Adjusted \(R^2\)
\[ \text{Adjusted } R^2 = 1 - \left(1 - R^2\right) \frac{n - 1}{n - k - 1} \] where \(n\) is the sample size and \(k\) is the number of predictors. Adjusted \(R^2\) can decrease when unnecessary predictors are added, making it a more reliable metric than \(R^2\) for model selection. Can be computed in R using summary(lm_model)$adj.r.squared.

Adjusted \(R^2\) is particulary useful because of it’s standardised scale (0 to 1), and the interpretation of \(R^2\) as the proportion of variance explained still holds - now with the added benefit of penalising unnecessary complexity. However, it is important to note that adjusted \(R^2\) is just one of many metrics available for assessing model fit while accounting for complexity.

4.2.2.2 Information Criteria: AIC, BIC

Another common approach to balancing model fit and complexity is to use information criteria, such as the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). These criteria provide a quantitative way to compare models, penalising those with more parameters.

Key-term: Akaike Information Criterion (AIC)
\[ \text{AIC} = 2k - 2\ln(L) \] where \(k\) is the number of parameters in the model and \(L\) is the likelihood of the model. Lower AIC values indicate a better balance between fit and complexity. Can be computed in R using AIC() function.

Key-term: Bayesian Information Criterion (BIC)
\[ \text{BIC} = \ln(n)k - 2\ln(L) \] where \(n\) is the sample size, \(k\) is the number of parameters, and \(L\) is the likelihood of the model. Like AIC, lower BIC values indicate a better balance between fit and complexity. Can be computed in R using BIC() function.

Both AIC and BIC penalise model complexity, but BIC generally imposes a larger penalty for additional parameters, especially in larger samples. The choice between AIC and BIC often depends on the context and goals of the analysis. AIC is generally preferred for predictive accuracy, while BIC is more conservative and favours simpler models.

Key-point: Interpreting AIC and BIC
Unlike adjusted \(R^2\), AIC and BIC do not have a fixed scale, so their absolute values are not interpretable on their own. Instead, they are used to compare models fitted to the same dataset - the model with the lowest AIC or BIC is considered the best among the candidates.

4.2.2.3 Calculating fit metrics with broom::glance()

In R, we can calculate adjusted \(R^2\), AIC, and BIC using the broom package’s glance() function, which provides a tidy summary of model fit statistics. For example, using our fitted 2nd-order interaction model of the advertising dataset:

library(broom)
glance(ads_mod_int)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.968         0.967 0.944     1963. 6.68e-146     3  -270.  550.  567.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Note that glance() returns a data frame with the mentioned fit metrics, along with other useful statistics such as the number of observations (nobs), residual standard error (sigma) and the global F-statistic (statistic). Alongside tidy(), glance() is a powerful tool for summarising and comparing regression models in R.

4.2.2.4 Comparing model fit

When comparing multiple models fitted to the same dataset, we can use adjusted \(R^2\), AIC, and BIC to assess which model provides the best balance between fit and complexity: the key points to remember are:

  • Higher adjusted \(R^2\) values indicate better fit.
  • Lower AIC and BIC values indicate better fit.

Example 6: Comparing first-order and interaction models
Lets continue our investigation of the advertising dataset from ?sec-mlr by comparing our first-order model of Sales (ads_mod) to our second-order interaction model (ads_mod_int).

Lets combine the fit metrics from both models into a single table for easy comparison:

# A tibble: 2 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
*     <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.897         0.896 1.68       860. 4.83e- 98     2  -386.  780.  794.
2     0.968         0.967 0.944     1963. 6.68e-146     3  -270.  550.  567.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
ads_fit_glance <- rbind(
  glance(ads_mod) |> `rownames<-`("First-order model"),
  glance(ads_mod_int) |> `rownames<-`("Interaction model"))

ads_fit_glance

Since the interaction model ?eq-ads_int differs from the first-order model ?eq-ads_mod by the addition of only the \(TV \times Radio\) interaction term, we can interpret the change in fit metrics as the change in model fit due to adding this term. Lets go through each fit metric in turn: - \(R^2\): Since the RSE (sigma) decreases, \(R^2\) increases from approximately 0.897 to 0.915, indicating that the interaction model explains more variance in Sales than the first-order model (as expected). Moreover, - Adjusted \(R^2\) also increases a similar ammount, suggesting that the improvement in fit is substantial enough to outweigh the penalty for adding an additional parameter. - AIC decreases from approximately 315.6 to 308.3, indicating that the interaction model provides a better balance between fit and complexity. - BIC also decreases from approximately 322.2 to 317.4, further supporting the conclusion that the interaction model is preferable.

This seems pretty clearcut: the interaction model provides a much better fit to the data (as we saw in our plots of ?sec-ads_int), and the fit metrics all indicate that the improvement in fit justifies the added complexity of the interaction term. But can we improve our model further by adding more terms?

Lets fit the ‘full second-order model’ including all main effects and the interaction term, as well as quadratic terms for both TV and Radio advertising:

ads_mod_full <- lm(Sales ~ TV + Radio + I(TV^2) + I(Radio^2) + TV:Radio, data = ads)

ads_fit_glance <- ads_fit_glance |> rbind(
  glance(ads_mod_full) |> `rownames<-`("Full second-order model")
)
ads_fit_glance
# A tibble: 3 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
*     <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.897         0.896 1.68       860. 4.83e- 98     2  -386.  780.  794.
2     0.968         0.967 0.944     1963. 6.68e-146     3  -270.  550.  567.
3     0.986         0.986 0.624     2740. 8.17e-178     5  -187.  387.  410.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

once again, we see that the fit metrics all improve. Note that at this stage, the model has become difficult to interpret - we have lost the simple meaning of the parameters as marginal effects of each predictor. Now, the effect of each predictor depends on both its own level, its interaction with the other predictor, and the squared terms all at once.

Nevertheless, we can continue to add more terms, lets fit a third-order model including cubic terms for both predictors:

ads_mod_cubic <- lm(Sales ~ TV + Radio + I(TV^2) + I(Radio^2) + I(TV^3) + I(Radio^3) + TV:Radio, data = ads)
ads_fit_glance |> rbind(
  glance(ads_mod_cubic) |> `rownames<-`("Cubic model"))
# A tibble: 4 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
*     <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.897         0.896 1.68       860. 4.83e- 98     2  -386.  780.  794.
2     0.968         0.967 0.944     1963. 6.68e-146     3  -270.  550.  567.
3     0.986         0.986 0.624     2740. 8.17e-178     5  -187.  387.  410.
4     0.991         0.991 0.503     3027. 9.19e-193     7  -142.  303.  333.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Here, we see that while \(R^2\) and adjusted \(R^2\) continue

4.3 A model building workflow

4.3.1 Exploratory data analysis and theory-driven model development

4.3.2 Automated model selection methods

4.3.3 Stepwise Regression

Stepwise methods provide automated ways to search for simpler models.

4.3.3.1 Forward Selection

Begin with a minimal model (often the intercept). Add predictors one at a time when they improve AIC or adjusted \(R^2\).

4.3.3.2 Backward Selection

Begin with a saturated model containing all candidate predictors. Remove predictors that do not meaningfully contribute.

Note
Stepwise procedures should be treated as screening tools, not definitive modelling strategies. Final model decisions should consider diagnostics, interpretability, and subject-matter knowledge.


4.4 Exercises

Exercise 1

A scatterplot of \(Y\) vs. \(X\) shows curvature.

  1. Fit a straight-line model and inspect residuals.

  2. Fit a model including \(X^2\).

  3. Compare AIC and adjusted \(R^2\) between the models.

  4. Discuss whether the added complexity is justified.

    :::